library(topGO)
library(knitr) # to use kable()
library(limma) # to use vennDiagram()
library(ggplot2)
TAG <- params$TAG
VAR <- params$VAR
ANNOTATION <- list(genes = '../2019-07-26/genes/annotation.txt',
isoforms = '../2019-07-26/transcripts/annotation.txt')
TAGDIR <- paste0('../2020-01-08/', TAG, '/')
Introduction
This is the enrichment analysis for isoforms regulated by regime. Because I quantified the association of expression with regime in two different ways, this document has two sections. Section Variance reports the enrichment analysis performed with isoforms ordered by the proportion of expression variance explained by regime. Note that some isoforms with low variance may still be highly associated with regime, even if the fold change between levels of this factor is low. Section Differential expression uses an ordering of isoforms based on the significance of the differential expression between levels of regime, which does depend on fold change.
Reading the data
Functional annotation is in 2019-07-26. I will also upload two lists of isoforms, with either proportion of variance explained by regime or p-value of differential expression test.
tagVariance <- read.table(paste0(TAGDIR, VAR, '_variance.txt'))
tagPValue <- read.table(paste0(TAGDIR, VAR, '_pvalue.txt'))
annotation <- read.table(ANNOTATION[[TAG]], col.names = c('tagname', 'goterms'))
To initialize the topGOdata object, I will need the gene list as a named numeric vector, where the names are the isoforms identifiers and the numeric values, either the portion of variance explained by regime or the p-values of the differential expression test. The structure() function adds attributes to an object.
Variance <- structure(tagVariance[,1], names = row.names(tagVariance))
PValues <- structure(tagPValue[,1], names = row.names(tagPValue))
rm(tagVariance, tagPValue)
There are 33037 isoforms scored with a variance portion and a differential expression p-value. It should actually be the exact same isoforms. The annotation data frame has more than one GO term for every tag, separated by |. I need a named list.
head(annotation)
## tagname
## 1 TCONS_00000002
## 2 TCONS_00000010
## 3 TCONS_00000011
## 4 TCONS_00000016
## 5 TCONS_00000017
## 6 TCONS_00000018
## goterms
## 1 GO:0008417|GO:0016020|GO:0006486
## 2 GO:0043130|GO:0005515|GO:0043161
## 3 GO:0005840|GO:0015935|GO:0003735|GO:0006412
## 4 GO:0006355|GO:0003700|GO:0006357|GO:0043565|GO:0005634|GO:0032502
## 5 GO:0006355|GO:0003700|GO:0006357|GO:0043565|GO:0005634|GO:0032502
## 6 GO:0006355|GO:0003700|GO:0006357|GO:0043565|GO:0005634|GO:0032502
allgenes2GO <- strsplit(as.character(annotation$goterms), "|", fixed = TRUE)
names(allgenes2GO) <- annotation$tagname
rm(annotation)
There are 19735 isoforms with GO annotations. But the differential expression analysis includes many more isoforms. In order to include the not-annotated isoforms in the enrichment analysis, to see if annotated or not annotated isoforms are more or less often differentially expressed, I should attribute a GO term to them. According to [http://geneontology.org/docs/faq/] nowadays we express lack of annotation by annotating to the root nodes, i.e. GO:0008150 biological_process, GO:0003674 molecular_function, and GO:0005575 cellular_component.
for (tag in unique(c(names(PValues), names(Variance)))) {
if (! tag %in% names(allgenes2GO)) {
allgenes2GO <- append(allgenes2GO,
structure(list(c("GO:0008150", "GO:0003674", "GO:0005575")), names = tag))
}
}
Using differential expression p-values
Building the topGO object
Creation of a topGO dataset is documented in section 4 of topGO’s the user manual: https://bioconductor.org/packages/release/bioc/vignettes/topGO/inst/doc/topGO.pdf. I need to use the new command, and fill up the slots. The annot object must be a function that compiles “a list of GO terms such that each element in the list is a character vector containing all the gene identifiers that are mapped to the respective GO term.” There are several options, that you can check using help(annFUN.gene2GO), for example. The annFUN.gene2GO requires a gene2GO argument, which is the list of gene-to-GO terms I made before. The geneSelectionFun object is a function that selects the interesting (most significant) genes. It is required to perform Fisher’s exact test. The nodeSize is used to prune the GO hierarchy from the terms which have less than n annotated genes.
I generate three datasets, to analyse each of the three ontologies.
selection <- function(allScores) {return(allScores < 0.05)}
GOdata.BP <- new('topGOdata',
description = 'Differential expression among Brachionus plicatilis populations.',
ontology = 'BP',
allGenes = PValues,
annot = annFUN.gene2GO,
gene2GO = allgenes2GO,
geneSelectionFun = selection,
nodeSize = 5)
GOdata.MF <- new('topGOdata',
description = 'Differential expression among Brachionus plicatilis populations.',
ontology = 'MF',
allGenes = PValues,
annot = annFUN.gene2GO,
gene2GO = allgenes2GO,
geneSelectionFun = selection,
nodeSize = 5)
GOdata.CC <- new('topGOdata',
description = 'Differential expression among Brachionus plicatilis populations.',
ontology = 'CC',
allGenes = PValues,
annot = annFUN.gene2GO,
gene2GO = allgenes2GO,
geneSelectionFun = selection,
nodeSize = 5)
DataSummary <- data.frame(ontology = c('BP', 'MF', 'CC'),
Num_Genes = sapply(list(GOdata.BP, GOdata.MF, GOdata.CC), numGenes),
Num_GO_terms = sapply(list(GOdata.BP, GOdata.MF, GOdata.CC), function(x) length(usedGO(x))))
kable(DataSummary, caption='Number of feasible genes or transcripts and number of GO terms used in each data set.')
rm(DataSummary)
Number of feasible genes or transcripts and number of GO terms used in each data set.
| BP |
26235 |
1582 |
| MF |
30580 |
791 |
| CC |
21616 |
390 |
Running the tests
There are more than one way to test for enrichment. Something that took me a while to understand is that not only there are different test statistics (Fisher’s exact test, Kolmogorov-Smirnov, and others) but also different algorithms: classic, elim, weight… The algorithms are ways to deal with the dependence structure among GO terms due to topology. Some algorithms are compatible with all statistics implemented in topGO. But the weight and the parentchild algorithms are only applicable to Fisher’s exact test. I am not interested in the classic algorithm, which treats GO nodes as independent, and therefore produces an excess of significant results. I will not use the Fisher’s exact test, because it dependes on an arbitrary threshold of significance on non-adjusted p-values.
BP.elim <- runTest(GOdata.BP, algorithm = "elim", statistic = "ks")
BP.weight01 <- runTest(GOdata.BP, algorithm = "weight01", statistic = "ks")
BP.lea <- runTest(GOdata.BP, algorithm = "lea", statistic = "ks")
MF.elim <- runTest(GOdata.MF, algorithm = "elim", statistic = "ks")
MF.weight01 <- runTest(GOdata.MF, algorithm = "weight01", statistic = "ks")
MF.lea <- runTest(GOdata.MF, algorithm = "lea", statistic = "ks")
CC.elim <- runTest(GOdata.CC, algorithm = "elim", statistic = "ks")
CC.weight01 <- runTest(GOdata.CC, algorithm = "weight01", statistic = "ks")
CC.lea <- runTest(GOdata.CC, algorithm = "lea", statistic = "ks")
ResultsSummary <- data.frame(ontology = rep(c("BP", "MF", "CC"), each = 3),
algorithm = rep(c("elim", "weight01", "lea"), 3),
TermsTested = sapply(list(BP.elim, BP.weight01, BP.lea, MF.elim, MF.weight01, MF.lea, CC.elim, CC.weight01, CC.lea), function(X) length(score(X))),
Significant = sapply(list(BP.elim, BP.weight01, BP.lea, MF.elim, MF.weight01, MF.lea, CC.elim, CC.weight01, CC.lea), function(X) sum(score(X) < 0.01)))
kable(ResultsSummary, caption="Number of non-trivial terms tested and those with a score (not corrected p-value) lower than 0.01.")
Number of non-trivial terms tested and those with a score (not corrected p-value) lower than 0.01.
| BP |
elim |
1582 |
19 |
| BP |
weight01 |
1582 |
19 |
| BP |
lea |
1582 |
25 |
| MF |
elim |
791 |
21 |
| MF |
weight01 |
791 |
19 |
| MF |
lea |
791 |
31 |
| CC |
elim |
390 |
13 |
| CC |
weight01 |
390 |
11 |
| CC |
lea |
390 |
21 |
rm(ResultsSummary)
Results
The topGO package does not pay much attention to what terms are significant because they are overrepresented and which ones are underrepresented among the most differentially expressed genes. I think it’s worth separating them, to facilitate the biological interpretation. Note that not all terms listed in the tables below are significant. The scores for the three methods (elim, weight01 and lea) are non-corrected p-values.
Biological process
orderedTerms <- names(sort(score(BP.weight01)))
significant.weight01 <- score(BP.weight01)[orderedTerms] <= 0.01
significant.lea <- score(BP.lea)[orderedTerms] <= 0.01
significant.elim <- score(BP.elim)[orderedTerms] <= 0.01
sigTerms <- orderedTerms[significant.weight01 & significant.lea & significant.elim]
# sigTerms gets updated, and is already used in the code, but I need a permanent copy for later.
BP.pvalue.sigTerms <- sigTerms
BP.all <- GenTable(GOdata.BP, elim=BP.elim, weight01=BP.weight01, lea=BP.lea,
orderBy="weight01", ranksOf="elim", topNodes=sum(significant.elim))
kable(
BP.all[BP.all$Significant > BP.all$Expected,],
caption = "Most over-represented terms of the Biological Process ontology.")
Most over-represented terms of the Biological Process ontology.
| 1 |
GO:0006508 |
proteolysis |
822 |
271 |
227.16 |
2 |
3.1e-05 |
2.1e-06 |
1.2e-05 |
| 2 |
GO:0007186 |
G protein-coupled receptor signaling pat… |
302 |
107 |
83.46 |
1 |
3.4e-06 |
2.3e-06 |
3.4e-06 |
| 3 |
GO:0060271 |
cilium assembly |
73 |
40 |
20.17 |
7 |
0.00036 |
2.3e-05 |
4.3e-07 |
| 4 |
GO:0006355 |
regulation of transcription, DNA-templat… |
621 |
198 |
171.61 |
3 |
4.2e-05 |
5.6e-05 |
4.2e-05 |
| 5 |
GO:0007154 |
cell communication |
1169 |
331 |
323.05 |
795 |
0.67468 |
0.00016 |
0.70887 |
| 6 |
GO:0060294 |
cilium movement involved in cell motilit… |
7 |
7 |
1.93 |
4 |
0.00018 |
0.00018 |
0.00018 |
| 7 |
GO:0000079 |
regulation of cyclin-dependent protein s… |
9 |
8 |
2.49 |
5 |
0.00030 |
0.00030 |
0.00030 |
| 8 |
GO:0006813 |
potassium ion transport |
103 |
42 |
28.46 |
8 |
0.00044 |
0.00049 |
0.00044 |
| 9 |
GO:0005991 |
trehalose metabolic process |
22 |
15 |
6.08 |
9 |
0.00105 |
0.00105 |
1.3e-05 |
| 10 |
GO:0035082 |
axoneme assembly |
18 |
13 |
4.97 |
6 |
0.00035 |
0.00226 |
0.00035 |
| 12 |
GO:0005992 |
trehalose biosynthetic process |
7 |
5 |
1.93 |
11 |
0.00315 |
0.00315 |
0.00315 |
| 14 |
GO:0006030 |
chitin metabolic process |
103 |
42 |
28.46 |
13 |
0.00596 |
0.00516 |
0.00596 |
| 15 |
GO:0005975 |
carbohydrate metabolic process |
360 |
121 |
99.49 |
104 |
0.06703 |
0.00642 |
0.05250 |
| 16 |
GO:0006367 |
transcription initiation from RNA polyme… |
20 |
10 |
5.53 |
15 |
0.00809 |
0.00809 |
0.00809 |
| 17 |
GO:0042246 |
tissue regeneration |
6 |
5 |
1.66 |
16 |
0.00815 |
0.00815 |
0.00815 |
| 18 |
GO:0006979 |
response to oxidative stress |
40 |
15 |
11.05 |
17 |
0.00892 |
0.00892 |
0.00892 |
| 19 |
GO:0031146 |
SCF-dependent proteasomal ubiquitin-depe… |
5 |
4 |
1.38 |
19 |
0.00988 |
0.00988 |
0.00988 |
kable(
BP.all[BP.all$Significant < BP.all$Expected,],
caption = "Most under-represented terms of the Biological Process ontology.")
Most under-represented terms of the Biological Process ontology.
| 11 |
GO:0044271 |
cellular nitrogen compound biosynthetic … |
1365 |
361 |
377.22 |
1484 |
0.99135 |
0.00255 |
0.97186 |
| 13 |
GO:0006520 |
cellular amino acid metabolic process |
208 |
57 |
57.48 |
502 |
0.43200 |
0.00321 |
0.61700 |
vennDiagram(vennCounts(cbind(weight01=significant.weight01,
lea=significant.lea,
elim=significant.elim)))

kable(data.frame(Term = Term(GOTERM[sigTerms]),
Definition = Definition(GOTERM[sigTerms]),
PValue=score(BP.weight01)[sigTerms]),
caption = paste('Biological process terms significantly associated with',
VAR, 'according to all 3 algorithms', sep=' '))
Biological process terms significantly associated with regime according to all 3 algorithms
| GO:0006508 |
proteolysis |
The hydrolysis of proteins into smaller polypeptides and/or amino acids by cleavage of their peptide bonds. |
0.0000021 |
| GO:0007186 |
G protein-coupled receptor signaling pathway |
A series of molecular signals that proceeds with an activated receptor promoting the exchange of GDP for GTP on the alpha-subunit of an associated heterotrimeric G-protein complex. The GTP-bound activated alpha-G-protein then dissociates from the beta- and gamma-subunits to further transmit the signal within the cell. The pathway begins with receptor-ligand interaction, or for basal GPCR signaling the pathway begins with the receptor activating its G protein in the absence of an agonist, and ends with regulation of a downstream cellular process, e.g. transcription. The pathway can start from the plasma membrane, Golgi or nuclear membrane (PMID:24568158 and PMID:16902576). |
0.0000023 |
| GO:0060271 |
cilium assembly |
The assembly of a cilium, a specialized eukaryotic organelle that consists of a filiform extrusion of the cell surface. Each cilium is bounded by an extrusion of the cytoplasmic membrane, and contains a regular longitudinal array of microtubules, anchored basally in a centriole. |
0.0000229 |
| GO:0006355 |
regulation of transcription, DNA-templated |
Any process that modulates the frequency, rate or extent of cellular DNA-templated transcription. |
0.0000562 |
| GO:0060294 |
cilium movement involved in cell motility |
Movement of cilia mediated by motor proteins that contributes to the movement of a cell. |
0.0001825 |
| GO:0000079 |
regulation of cyclin-dependent protein serine/threonine kinase activity |
Any process that modulates the frequency, rate or extent of cyclin-dependent protein serine/threonine kinase activity. |
0.0003024 |
| GO:0006813 |
potassium ion transport |
The directed movement of potassium ions (K+) into, out of or within a cell, or between cells, by means of some agent such as a transporter or pore. |
0.0004926 |
| GO:0005991 |
trehalose metabolic process |
The chemical reactions and pathways involving trehalose, a disaccharide isomeric with sucrose and obtained from certain lichens and fungi. |
0.0010535 |
| GO:0035082 |
axoneme assembly |
The assembly and organization of an axoneme, the bundle of microtubules and associated proteins that forms the core of cilia (also called flagella) in eukaryotic cells and is responsible for their movements. |
0.0022620 |
| GO:0005992 |
trehalose biosynthetic process |
The chemical reactions and pathways resulting in the formation of trehalose, a disaccharide isomeric with sucrose and obtained from certain lichens and fungi. |
0.0031545 |
| GO:0006030 |
chitin metabolic process |
The chemical reactions and pathways involving chitin, a linear polysaccharide consisting of beta-(1->4)-linked N-acetyl-D-glucosamine residues. |
0.0051557 |
| GO:0006367 |
transcription initiation from RNA polymerase II promoter |
Any process involved in the assembly of the RNA polymerase II preinitiation complex (PIC) at an RNA polymerase II promoter region of a DNA template, resulting in the subsequent synthesis of RNA from that promoter. The initiation phase includes PIC assembly and the formation of the first few bonds in the RNA chain, including abortive initiation, which occurs when the first few nucleotides are repeatedly synthesized and then released. Promoter clearance, or release, is the transition between the initiation and elongation phases of transcription. |
0.0080904 |
| GO:0042246 |
tissue regeneration |
The regrowth of lost or destroyed tissues. |
0.0081542 |
| GO:0006979 |
response to oxidative stress |
Any process that results in a change in state or activity of a cell or an organism (in terms of movement, secretion, enzyme production, gene expression, etc.) as a result of oxidative stress, a state often resulting from exposure to high levels of reactive oxygen species, e.g. superoxide anions, hydrogen peroxide (H2O2), and hydroxyl radicals. |
0.0089181 |
| GO:0031146 |
SCF-dependent proteasomal ubiquitin-dependent protein catabolic process |
The chemical reactions and pathways resulting in the breakdown of a protein or peptide by hydrolysis of its peptide bonds, initiated by the covalent attachment of ubiquitin, with ubiquitin-protein ligation catalyzed by an SCF (Skp1/Cul1/F-box protein) complex, and mediated by the proteasome. |
0.0098752 |
I think the GO graph is useful to see the relationship among the significant terms. But too large graphs are impossible to read. I don’t know how to split the graph in meaningful subgraphs.
showSigOfNodes(GOdata.BP, score(BP.weight01),
firstSigNodes = sum(significant.elim),
wantedNodes = sigTerms)

## $dag
## A graphNEL graph with directed edges
## Number of Nodes = 166
## Number of Edges = 300
##
## $complete.dag
## [1] "A graph with 166 nodes."
This is just a example of the most significant GO term:
showGroupDensity(GOdata.BP, names(score(BP.weight01))[which.min(score(BP.weight01))])

Molecular function
orderedTerms <- names(sort(score(MF.weight01)))
significant.weight01 <- score(MF.weight01)[orderedTerms] <= 0.01
significant.lea <- score(MF.lea)[orderedTerms] <= 0.01
significant.elim <- score(MF.elim)[orderedTerms] <= 0.01
sigTerms <- orderedTerms[significant.weight01 & significant.lea & significant.elim]
# sigTerms gets updated, and is already used in the code, but I need a permanent copy for later.
MF.pvalue.sigTerms <- sigTerms
MF.all <- GenTable(GOdata.MF, elim=MF.elim, weight01=MF.weight01, lea=MF.lea,
orderBy="weight01", ranksOf="elim", topNodes=sum(significant.elim))
kable(
MF.all[MF.all$Significant > MF.all$Expected,],
caption = "Most over-represented terms of the Molecular Function ontology.")
Most over-represented terms of the Molecular Function ontology.
| GO:0005509 |
calcium ion binding |
638 |
255 |
177.46 |
1 |
2.1e-16 |
2.1e-16 |
2.1e-16 |
| GO:0004930 |
G protein-coupled receptor activity |
251 |
96 |
69.82 |
2 |
9.3e-06 |
6.5e-05 |
1.2e-06 |
| GO:0043565 |
sequence-specific DNA binding |
221 |
77 |
61.47 |
5 |
0.00068 |
9.2e-05 |
0.00068 |
| GO:0008417 |
fucosyltransferase activity |
42 |
20 |
11.68 |
3 |
0.00039 |
0.00039 |
0.00039 |
| GO:0030248 |
cellulose binding |
8 |
7 |
2.23 |
4 |
0.00054 |
0.00054 |
0.00054 |
| GO:0003774 |
motor activity |
277 |
93 |
77.05 |
158 |
0.14077 |
0.00097 |
0.14077 |
| GO:0004181 |
metallocarboxypeptidase activity |
39 |
22 |
10.85 |
6 |
0.00109 |
0.00109 |
0.00109 |
| GO:0004555 |
alpha,alpha-trehalase activity |
15 |
10 |
4.17 |
7 |
0.00113 |
0.00113 |
0.00113 |
| GO:0008134 |
transcription factor binding |
33 |
11 |
9.18 |
18 |
0.00804 |
0.00123 |
0.00804 |
| GO:0005200 |
structural constituent of cytoskeleton |
23 |
14 |
6.40 |
8 |
0.00150 |
0.00150 |
0.00150 |
| GO:0004252 |
serine-type endopeptidase activity |
143 |
53 |
39.78 |
10 |
0.00264 |
0.00264 |
0.00264 |
| GO:0016462 |
pyrophosphatase activity |
952 |
290 |
264.80 |
214 |
0.24145 |
0.00295 |
0.65612 |
| GO:0008061 |
chitin binding |
112 |
42 |
31.15 |
12 |
0.00379 |
0.00379 |
0.00379 |
| GO:0004198 |
calcium-dependent cysteine-type endopept… |
62 |
22 |
17.25 |
13 |
0.00390 |
0.00390 |
0.00390 |
| GO:0004611 |
phosphoenolpyruvate carboxykinase activi… |
7 |
6 |
1.95 |
14 |
0.00435 |
0.00435 |
0.00435 |
| GO:0004601 |
peroxidase activity |
59 |
21 |
16.41 |
115 |
0.07881 |
0.00437 |
0.07881 |
| GO:0005267 |
potassium channel activity |
80 |
33 |
22.25 |
11 |
0.00367 |
0.00593 |
0.00367 |
| GO:0004983 |
neuropeptide Y receptor activity |
11 |
7 |
3.06 |
16 |
0.00594 |
0.00594 |
0.00594 |
| GO:0008047 |
enzyme activator activity |
135 |
41 |
37.55 |
105 |
0.07008 |
0.00805 |
0.07008 |
| GO:0030246 |
carbohydrate binding |
83 |
40 |
23.09 |
28 |
0.01442 |
0.01061 |
0.04768 |
kable(
MF.all[MF.all$Significant < MF.all$Expected,],
caption = "Most under-represented terms of the Molecular Function ontology.")
Most under-represented terms of the Molecular Function ontology.
| 21 |
GO:0140098 |
catalytic activity, acting on RNA |
278 |
68 |
77.33 |
505 |
0.73141 |
0.01160 |
0.73141 |
vennDiagram(vennCounts(cbind(weight01=significant.weight01,
lea=significant.lea,
elim=significant.elim)))

kable(data.frame(Term=Term(GOTERM[sigTerms]),
Definition=Definition(GOTERM[sigTerms]),
PValue=score(MF.weight01)[sigTerms]),
caption = paste('Molecular function terms significantly associated with', VAR,
'according to all 3 algorithms', sep=' '))
Molecular function terms significantly associated with regime according to all 3 algorithms
| GO:0005509 |
calcium ion binding |
Interacting selectively and non-covalently with calcium ions (Ca2+). |
0.0000000 |
| GO:0004930 |
G protein-coupled receptor activity |
Combining with an extracellular signal and transmitting the signal across the membrane by activating an associated G-protein; promotes the exchange of GDP for GTP on the alpha subunit of a heterotrimeric G-protein complex. |
0.0000650 |
| GO:0043565 |
sequence-specific DNA binding |
Interacting selectively and non-covalently with DNA of a specific nucleotide composition, e.g. GC-rich DNA binding, or with a specific sequence motif or type of DNA e.g. promotor binding or rDNA binding. |
0.0000918 |
| GO:0008417 |
fucosyltransferase activity |
Catalysis of the transfer of a fucosyl group to an acceptor molecule, typically another carbohydrate or a lipid. |
0.0003941 |
| GO:0030248 |
cellulose binding |
Interacting selectively and non-covalently with cellulose. |
0.0005411 |
| GO:0004181 |
metallocarboxypeptidase activity |
Catalysis of the hydrolysis of C-terminal amino acid residues from a polypeptide chain by a mechanism in which water acts as a nucleophile, one or two metal ions hold the water molecule in place, and charged amino acid side chains are ligands for the metal ions. |
0.0010874 |
| GO:0004555 |
alpha,alpha-trehalase activity |
Catalysis of the reaction: alpha,alpha-trehalose + H2O = 2 D-glucose. |
0.0011320 |
| GO:0008134 |
transcription factor binding |
Interacting selectively and non-covalently with a transcription factor, any protein required to initiate or regulate transcription. |
0.0012305 |
| GO:0005200 |
structural constituent of cytoskeleton |
The action of a molecule that contributes to the structural integrity of a cytoskeletal structure. |
0.0014977 |
| GO:0004252 |
serine-type endopeptidase activity |
Catalysis of the hydrolysis of internal, alpha-peptide bonds in a polypeptide chain by a catalytic mechanism that involves a catalytic triad consisting of a serine nucleophile that is activated by a proton relay involving an acidic residue (e.g. aspartate or glutamate) and a basic residue (usually histidine). |
0.0026383 |
| GO:0008061 |
chitin binding |
Interacting selectively and non-covalently with chitin, a linear polysaccharide consisting of beta-(1->4)-linked N-acetyl-D-glucosamine residues. |
0.0037881 |
| GO:0004198 |
calcium-dependent cysteine-type endopeptidase activity |
Catalysis of the hydrolysis of nonterminal peptide bonds in a polypeptide chain by a mechanism using a cysteine residue at the enzyme active center, and requiring the presence of calcium. |
0.0039050 |
| GO:0004611 |
phosphoenolpyruvate carboxykinase activity |
Catalysis of the reaction: source of phosphate + oxaloacetate = phosphoenolpyruvate + CO2 + other reaction products. |
0.0043484 |
| GO:0005267 |
potassium channel activity |
Enables the facilitated diffusion of a potassium ion (by an energy-independent process) involving passage through a transmembrane aqueous pore or channel without evidence for a carrier-mediated mechanism. |
0.0059322 |
| GO:0004983 |
neuropeptide Y receptor activity |
Combining with neuropeptide Y to initiate a change in cell activity. |
0.0059417 |
showSigOfNodes(GOdata.MF, score(MF.weight01),
firstSigNodes = sum(significant.elim),
wantedNodes = sigTerms)

## $dag
## A graphNEL graph with directed edges
## Number of Nodes = 87
## Number of Edges = 103
##
## $complete.dag
## [1] "A graph with 87 nodes."
showGroupDensity(GOdata.MF, names(score(MF.weight01))[which.min(score(MF.weight01))])

Cellular component
orderedTerms <- names(sort(score(CC.weight01)))
significant.weight01 <- score(CC.weight01)[orderedTerms] <= 0.01
significant.lea <- score(CC.lea)[orderedTerms] <= 0.01
significant.elim <- score(CC.elim)[orderedTerms] <= 0.01
sigTerms <- orderedTerms[significant.weight01 & significant.lea & significant.elim]
# sigTerms gets updated, and is already used in the code, but I need a permanent copy for later.
CC.pvalue.sigTerms <- sigTerms
CC.all <- GenTable(GOdata.CC, elim=CC.elim, weight01=CC.weight01, lea=CC.lea,
orderBy="weight01", ranksOf="elim", topNodes=sum(significant.elim))
kable(
CC.all[CC.all$Significant > CC.all$Expected,],
caption = "Most over-represented terms of the Cellular Component ontology.")
Most over-represented terms of the Cellular Component ontology.
| GO:0005576 |
extracellular region |
197 |
83 |
54.27 |
1 |
1.3e-06 |
1.5e-05 |
1.3e-06 |
| GO:0005743 |
mitochondrial inner membrane |
44 |
23 |
12.12 |
2 |
3.6e-05 |
0.00032 |
3.6e-05 |
| GO:0016459 |
myosin complex |
57 |
27 |
15.70 |
5 |
0.00078 |
0.00078 |
0.00078 |
| GO:0005929 |
cilium |
52 |
27 |
14.33 |
6 |
0.00088 |
0.00102 |
0.00088 |
| GO:0016021 |
integral component of membrane |
1739 |
550 |
479.08 |
3 |
0.00046 |
0.00231 |
6.1e-05 |
| GO:0005886 |
plasma membrane |
324 |
120 |
89.26 |
7 |
0.00094 |
0.00485 |
0.00072 |
| GO:0090575 |
RNA polymerase II transcription factor c… |
47 |
19 |
12.95 |
29 |
0.02464 |
0.00505 |
0.02464 |
| GO:0098800 |
inner mitochondrial membrane protein com… |
27 |
13 |
7.44 |
25 |
0.02095 |
0.00515 |
0.02095 |
| GO:0042555 |
MCM complex |
12 |
8 |
3.31 |
10 |
0.00610 |
0.00610 |
0.00610 |
| GO:0099512 |
supramolecular fiber |
52 |
23 |
14.33 |
8 |
0.00353 |
0.00812 |
0.00353 |
| GO:0031225 |
anchored component of membrane |
8 |
5 |
2.20 |
13 |
0.00903 |
0.00903 |
0.00903 |
| GO:0016020 |
membrane |
3059 |
942 |
842.73 |
30 |
0.02530 |
0.01056 |
1.5e-07 |
| GO:0030123 |
AP-3 adaptor complex |
8 |
5 |
2.20 |
15 |
0.01204 |
0.01204 |
0.01204 |
kable(
CC.all[CC.all$Significant < CC.all$Expected,],
caption = "Most under-represented terms of the Cellular Component ontology.")
Most under-represented terms of the Cellular Component ontology.
vennDiagram(vennCounts(cbind(weight01=significant.weight01,
lea=significant.lea,
elim=significant.elim)))

kable(data.frame(Term=Term(GOTERM[sigTerms]),
Definition=Definition(GOTERM[sigTerms]),
PValue=score(CC.weight01)[sigTerms]),
caption = paste('Cellular component terms significantly associated with', VAR,
'according to all 3 algorithms', sep=' '))
Cellular component terms significantly associated with regime according to all 3 algorithms
| GO:0005576 |
extracellular region |
The space external to the outermost structure of a cell. For cells without external protective or external encapsulating structures this refers to space outside of the plasma membrane. This term covers the host cell environment outside an intracellular parasite. |
0.0000146 |
| GO:0005743 |
mitochondrial inner membrane |
The inner, i.e. lumen-facing, lipid bilayer of the mitochondrial envelope. It is highly folded to form cristae. |
0.0003229 |
| GO:0016459 |
myosin complex |
A protein complex, formed of one or more myosin heavy chains plus associated light chains and other proteins, that functions as a molecular motor; uses the energy of ATP hydrolysis to move actin filaments or to move vesicles or other cargo on fixed actin filaments; has magnesium-ATPase activity and binds actin. Myosin classes are distinguished based on sequence features of the motor, or head, domain, but also have distinct tail regions that are believed to bind specific cargoes. |
0.0007773 |
| GO:0005929 |
cilium |
A specialized eukaryotic organelle that consists of a filiform extrusion of the cell surface and of some cytoplasmic parts. Each cilium is largely bounded by an extrusion of the cytoplasmic (plasma) membrane, and contains a regular longitudinal array of microtubules, anchored to a basal body. |
0.0010225 |
| GO:0016021 |
integral component of membrane |
The component of a membrane consisting of the gene products and protein complexes having at least some part of their peptide sequence embedded in the hydrophobic region of the membrane. |
0.0023121 |
| GO:0005886 |
plasma membrane |
The membrane surrounding a cell that separates the cell from its external environment. It consists of a phospholipid bilayer and associated proteins. |
0.0048464 |
| GO:0042555 |
MCM complex |
A hexameric protein complex required for the initiation and regulation of DNA replication. |
0.0060991 |
| GO:0099512 |
supramolecular fiber |
A polymer consisting of an indefinite number of protein or protein complex subunits that have polymerised to form a fiber-shaped structure. |
0.0081229 |
| GO:0031225 |
anchored component of membrane |
The component of a membrane consisting of the gene products that are tethered to the membrane only by a covalently attached anchor, such as a lipid group that is embedded in the membrane. Gene products with peptide sequences that are embedded in the membrane are excluded from this grouping. |
0.0090256 |
showSigOfNodes(GOdata.CC, score(CC.weight01),
firstSigNodes = sum(significant.elim),
wantedNodes = sigTerms)

## $dag
## A graphNEL graph with directed edges
## Number of Nodes = 57
## Number of Edges = 99
##
## $complete.dag
## [1] "A graph with 57 nodes."
showGroupDensity(GOdata.CC, names(score(CC.weight01))[which.min(score(CC.weight01))])

Using the portion of variance explained by regime
Building the topGO object
I need to generate the topGO objects again, using the alternative gene ordering, based on the proportion of expression-level variance explained by regime. I miss a way to inform the topGOdata object that the score in this case is reversed, with respect to p-values: the higher it is, the more differentially expressed the gene is. To make sure that GO terms are tested in the same way than when using p-values, I will just reverse the proportion of variance explained by regime to its complement. Taking this into account, the subset of interesting genes (selection() function) must be defined as the lowest 10% scores, which are the 10% genes with largest portion of variance explained by regime.
selection <- function(allScores) {return(allScores <= quantile(allScores, probs = 0.10))}
GOdataVar.BP <- new('topGOdata',
description = 'Differential expression among Brachionus plicatilis populations.',
ontology = 'BP',
allGenes = 1.0 - Variance,
annot = annFUN.gene2GO,
gene2GO = allgenes2GO,
geneSelectionFun = selection,
nodeSize = 5)
GOdataVar.MF <- new('topGOdata',
description = 'Differential expression among Brachionus plicatilis populations.',
ontology = 'MF',
allGenes = 1.0 - Variance,
annot = annFUN.gene2GO,
gene2GO = allgenes2GO,
geneSelectionFun = selection,
nodeSize = 5)
GOdataVar.CC <- new('topGOdata',
description = 'Differential expression among Brachionus plicatilis populations.',
ontology = 'CC',
allGenes = 1.0 - Variance,
annot = annFUN.gene2GO,
gene2GO = allgenes2GO,
geneSelectionFun = selection,
nodeSize = 5)
library(knitr)
DataSummary <- data.frame(ontology = c('BP', 'MF', 'CC'),
Num_Genes = sapply(list(GOdataVar.BP, GOdataVar.MF, GOdataVar.CC), numGenes),
Num_GO_terms = sapply(list(GOdataVar.BP, GOdataVar.MF, GOdataVar.CC), function(x) length(usedGO(x))))
kable(DataSummary, caption='Number of feasible genes or transcripts and number of GO terms used in each data set.')
rm(DataSummary)
Number of feasible genes or transcripts and number of GO terms used in each data set.
| BP |
26235 |
1582 |
| MF |
30580 |
791 |
| CC |
21616 |
390 |
Running the tests
BPvar.elim <- runTest(GOdataVar.BP, algorithm = "elim", statistic = "ks")
BPvar.weight01 <- runTest(GOdataVar.BP, algorithm = "weight01", statistic = "ks")
BPvar.lea <- runTest(GOdataVar.BP, algorithm = "lea", statistic = "ks")
MFvar.elim <- runTest(GOdataVar.MF, algorithm = "elim", statistic = "ks")
MFvar.weight01 <- runTest(GOdataVar.MF, algorithm = "weight01", statistic = "ks")
MFvar.lea <- runTest(GOdataVar.MF, algorithm = "lea", statistic = "ks")
CCvar.elim <- runTest(GOdataVar.CC, algorithm = "elim", statistic = "ks")
CCvar.weight01 <- runTest(GOdataVar.CC, algorithm = "weight01", statistic = "ks")
CCvar.lea <- runTest(GOdataVar.CC, algorithm = "lea", statistic = "ks")
ResultsSummary <- data.frame(ontology = rep(c("BP", "MF", "CC"), each = 3),
algorithm = rep(c("elim", "weight01", "lea"), 3),
TermsTested = sapply(list(BPvar.elim, BPvar.weight01, BPvar.lea, MFvar.elim, MFvar.weight01, MFvar.lea, CCvar.elim, CCvar.weight01, CCvar.lea), function(X) length(score(X))),
Significant = sapply(list(BPvar.elim, BPvar.weight01, BPvar.lea, MFvar.elim, MFvar.weight01, MFvar.lea, CCvar.elim, CCvar.weight01, CCvar.lea), function(X) sum(score(X) < 0.01)))
kable(ResultsSummary, caption="Number of non-trivial terms tested and those with a score (not corrected p-value) lower than 0.01.")
Number of non-trivial terms tested and those with a score (not corrected p-value) lower than 0.01.
| BP |
elim |
1582 |
24 |
| BP |
weight01 |
1582 |
28 |
| BP |
lea |
1582 |
32 |
| MF |
elim |
791 |
30 |
| MF |
weight01 |
791 |
29 |
| MF |
lea |
791 |
45 |
| CC |
elim |
390 |
15 |
| CC |
weight01 |
390 |
15 |
| CC |
lea |
390 |
24 |
rm(ResultsSummary)
Results
Biological process
orderedTerms <- names(sort(score(BPvar.weight01)))
significant.weight01 <- score(BPvar.weight01)[orderedTerms] <= 0.01
significant.lea <- score(BPvar.lea)[orderedTerms] <= 0.01
significant.elim <- score(BPvar.elim)[orderedTerms] <= 0.01
sigTerms <- orderedTerms[significant.weight01 & significant.lea & significant.elim]
# sigTerms gets updated, and is already used in the code, but I need a permanent copy for later.
BP.variance.sigTerms <- sigTerms
BPvar.all <- GenTable(GOdataVar.BP, elim=BPvar.elim, weight01=BPvar.weight01, lea=BPvar.lea,
orderBy="weight01", ranksOf="elim", topNodes=sum(significant.elim))
kable(
BPvar.all[BPvar.all$Significant > BPvar.all$Expected,],
caption = "Most over-represented terms of the Biological Process ontology.")
Most over-represented terms of the Biological Process ontology.
| 1 |
GO:0007186 |
G protein-coupled receptor signaling pat… |
302 |
42 |
29.80 |
1 |
5.1e-08 |
1.0e-09 |
5.1e-08 |
| 2 |
GO:0006030 |
chitin metabolic process |
103 |
19 |
10.16 |
2 |
6.1e-08 |
1.4e-07 |
6.1e-08 |
| 3 |
GO:0006508 |
proteolysis |
822 |
125 |
81.12 |
6 |
0.00017 |
1.7e-07 |
7.9e-05 |
| 4 |
GO:0060271 |
cilium assembly |
73 |
20 |
7.20 |
4 |
0.00012 |
8.0e-05 |
6.9e-07 |
| 6 |
GO:0006355 |
regulation of transcription, DNA-templat… |
621 |
64 |
61.28 |
3 |
0.00011 |
0.00023 |
0.00011 |
| 7 |
GO:0006813 |
potassium ion transport |
103 |
20 |
10.16 |
7 |
0.00034 |
0.00034 |
2.8e-05 |
| 8 |
GO:0071805 |
potassium ion transmembrane transport |
20 |
7 |
1.97 |
15 |
0.00401 |
0.00079 |
0.00401 |
| 9 |
GO:0005991 |
trehalose metabolic process |
22 |
10 |
2.17 |
9 |
0.00080 |
0.00080 |
8.5e-06 |
| 10 |
GO:0007154 |
cell communication |
1169 |
116 |
115.36 |
1292 |
0.96294 |
0.00085 |
0.98254 |
| 11 |
GO:0060294 |
cilium movement involved in cell motilit… |
7 |
4 |
0.69 |
11 |
0.00155 |
0.00155 |
0.00155 |
| 12 |
GO:0005992 |
trehalose biosynthetic process |
7 |
5 |
0.69 |
12 |
0.00177 |
0.00177 |
0.00177 |
| 14 |
GO:0006814 |
sodium ion transport |
108 |
14 |
10.66 |
13 |
0.00266 |
0.00266 |
0.00266 |
| 16 |
GO:0003341 |
cilium movement |
22 |
8 |
2.17 |
14 |
0.00400 |
0.00400 |
0.00013 |
| 17 |
GO:0007017 |
microtubule-based process |
389 |
59 |
38.39 |
240 |
0.14772 |
0.00411 |
0.14772 |
| 19 |
GO:0034220 |
ion transmembrane transport |
306 |
49 |
30.20 |
25 |
0.01019 |
0.00531 |
0.00092 |
| 20 |
GO:0005975 |
carbohydrate metabolic process |
360 |
55 |
35.53 |
129 |
0.06969 |
0.00535 |
0.05990 |
| 22 |
GO:0000079 |
regulation of cyclin-dependent protein s… |
9 |
2 |
0.89 |
19 |
0.00570 |
0.00570 |
0.00570 |
| 23 |
GO:0055085 |
transmembrane transport |
1062 |
152 |
104.80 |
8 |
0.00054 |
0.00605 |
0.00015 |
| 24 |
GO:0042246 |
tissue regeneration |
6 |
2 |
0.59 |
20 |
0.00624 |
0.00624 |
0.00624 |
kable(
BPvar.all[BPvar.all$Significant < BPvar.all$Expected,],
caption = "Most under-represented terms of the Biological Process ontology.")
Most under-represented terms of the Biological Process ontology.
| 5 |
GO:0007156 |
homophilic cell adhesion via plasma memb… |
31 |
3 |
3.06 |
5 |
0.00016 |
0.00016 |
0.00016 |
| 13 |
GO:0044271 |
cellular nitrogen compound biosynthetic … |
1365 |
99 |
134.70 |
1127 |
0.88493 |
0.00239 |
0.97663 |
| 15 |
GO:0006520 |
cellular amino acid metabolic process |
208 |
16 |
20.53 |
1035 |
0.84021 |
0.00297 |
0.98763 |
| 18 |
GO:0006464 |
cellular protein modification process |
1369 |
112 |
135.10 |
1418 |
0.98480 |
0.00417 |
0.97305 |
| 21 |
GO:0006486 |
protein glycosylation |
106 |
6 |
10.46 |
122 |
0.06463 |
0.00547 |
0.06463 |
vennDiagram(vennCounts(cbind(weight01=significant.weight01,
lea=significant.lea,
elim=significant.elim)))

kable(data.frame(Term=Term(GOTERM[sigTerms]),
Definition=Definition(GOTERM[sigTerms]),
PValue=score(BPvar.weight01)[sigTerms]),
caption = paste('Biological process terms significantly associated with', VAR,
'according to all 3 algorithms', sep=' '))
Biological process terms significantly associated with regime according to all 3 algorithms
| GO:0007186 |
G protein-coupled receptor signaling pathway |
A series of molecular signals that proceeds with an activated receptor promoting the exchange of GDP for GTP on the alpha-subunit of an associated heterotrimeric G-protein complex. The GTP-bound activated alpha-G-protein then dissociates from the beta- and gamma-subunits to further transmit the signal within the cell. The pathway begins with receptor-ligand interaction, or for basal GPCR signaling the pathway begins with the receptor activating its G protein in the absence of an agonist, and ends with regulation of a downstream cellular process, e.g. transcription. The pathway can start from the plasma membrane, Golgi or nuclear membrane (PMID:24568158 and PMID:16902576). |
0.0000000 |
| GO:0006030 |
chitin metabolic process |
The chemical reactions and pathways involving chitin, a linear polysaccharide consisting of beta-(1->4)-linked N-acetyl-D-glucosamine residues. |
0.0000001 |
| GO:0006508 |
proteolysis |
The hydrolysis of proteins into smaller polypeptides and/or amino acids by cleavage of their peptide bonds. |
0.0000002 |
| GO:0060271 |
cilium assembly |
The assembly of a cilium, a specialized eukaryotic organelle that consists of a filiform extrusion of the cell surface. Each cilium is bounded by an extrusion of the cytoplasmic membrane, and contains a regular longitudinal array of microtubules, anchored basally in a centriole. |
0.0000796 |
| GO:0007156 |
homophilic cell adhesion via plasma membrane adhesion molecules |
The attachment of a plasma membrane adhesion molecule in one cell to an identical molecule in an adjacent cell. |
0.0001597 |
| GO:0006355 |
regulation of transcription, DNA-templated |
Any process that modulates the frequency, rate or extent of cellular DNA-templated transcription. |
0.0002253 |
| GO:0006813 |
potassium ion transport |
The directed movement of potassium ions (K+) into, out of or within a cell, or between cells, by means of some agent such as a transporter or pore. |
0.0003356 |
| GO:0071805 |
potassium ion transmembrane transport |
A process in which a potassium ion is transported from one side of a membrane to the other. |
0.0007875 |
| GO:0005991 |
trehalose metabolic process |
The chemical reactions and pathways involving trehalose, a disaccharide isomeric with sucrose and obtained from certain lichens and fungi. |
0.0008019 |
| GO:0060294 |
cilium movement involved in cell motility |
Movement of cilia mediated by motor proteins that contributes to the movement of a cell. |
0.0015464 |
| GO:0005992 |
trehalose biosynthetic process |
The chemical reactions and pathways resulting in the formation of trehalose, a disaccharide isomeric with sucrose and obtained from certain lichens and fungi. |
0.0017746 |
| GO:0006814 |
sodium ion transport |
The directed movement of sodium ions (Na+) into, out of or within a cell, or between cells, by means of some agent such as a transporter or pore. |
0.0026564 |
| GO:0003341 |
cilium movement |
The directed, self-propelled movement of a cilium. |
0.0039975 |
| GO:0000079 |
regulation of cyclin-dependent protein serine/threonine kinase activity |
Any process that modulates the frequency, rate or extent of cyclin-dependent protein serine/threonine kinase activity. |
0.0057002 |
| GO:0055085 |
transmembrane transport |
The process in which a solute is transported across a lipid bilayer, from one side of a membrane to the other. |
0.0060495 |
| GO:0042246 |
tissue regeneration |
The regrowth of lost or destroyed tissues. |
0.0062426 |
| GO:0007160 |
cell-matrix adhesion |
The binding of a cell to the extracellular matrix via adhesion molecules. |
0.0071326 |
| GO:0016539 |
intein-mediated protein splicing |
The removal of an internal amino acid sequence (an intein) from a protein during protein maturation; the excision of inteins is precise and the N- and C-terminal exteins are joined by a normal peptide bond. Protein splicing involves 4 nucleophilic displacements by the 3 conserved splice junction residues. |
0.0075827 |
| GO:0006821 |
chloride transport |
The directed movement of chloride into, out of or within a cell, or between cells, by means of some agent such as a transporter or pore. |
0.0080310 |
showSigOfNodes(GOdataVar.BP, score(BPvar.weight01),
firstSigNodes = sum(significant.elim),
wantedNodes = sigTerms)

## $dag
## A graphNEL graph with directed edges
## Number of Nodes = 172
## Number of Edges = 305
##
## $complete.dag
## [1] "A graph with 172 nodes."
Below I plot variance portion, but for the termo found most significant when using p-values, for comparison.
showGroupDensity(GOdataVar.BP, names(score(BP.weight01))[which.min(score(BP.weight01))])

Molecular function
orderedTerms <- names(sort(score(MFvar.weight01)))
significant.weight01 <- score(MFvar.weight01)[orderedTerms] <= 0.01
significant.lea <- score(MFvar.lea)[orderedTerms] <= 0.01
significant.elim <- score(MFvar.elim)[orderedTerms] <= 0.01
sigTerms <- orderedTerms[significant.weight01 & significant.lea & significant.elim]
# sigTerms gets updated, and is already used in the code, but I need a permanent copy for later.
MF.variance.sigTerms <- sigTerms
MFvar.all <- GenTable(GOdataVar.MF, elim=MFvar.elim, weight01=MFvar.weight01, lea=MFvar.lea,
orderBy="weight01", ranksOf="elim", topNodes=sum(significant.elim))
kable(
MFvar.all[MFvar.all$Significant > MFvar.all$Expected,],
caption = "Most over-represented terms of the Molecular Function ontology.")
Most over-represented terms of the Molecular Function ontology.
| 1 |
GO:0005509 |
calcium ion binding |
638 |
124 |
64.63 |
1 |
2.5e-20 |
2.5e-20 |
2.5e-20 |
| 2 |
GO:0004930 |
G protein-coupled receptor activity |
251 |
36 |
25.43 |
2 |
4.6e-08 |
4.4e-08 |
5.5e-10 |
| 3 |
GO:0008061 |
chitin binding |
112 |
20 |
11.35 |
3 |
1.8e-07 |
1.8e-07 |
1.8e-07 |
| 4 |
GO:0043565 |
sequence-specific DNA binding |
221 |
25 |
22.39 |
5 |
9.2e-05 |
6.9e-06 |
9.2e-05 |
| 5 |
GO:0004181 |
metallocarboxypeptidase activity |
39 |
12 |
3.95 |
4 |
3.8e-05 |
3.8e-05 |
3.8e-05 |
| 6 |
GO:0003774 |
motor activity |
277 |
48 |
28.06 |
15 |
0.00353 |
8.4e-05 |
0.00353 |
| 8 |
GO:0005200 |
structural constituent of cytoskeleton |
23 |
10 |
2.33 |
7 |
0.00015 |
0.00015 |
0.00015 |
| 9 |
GO:0005267 |
potassium channel activity |
80 |
18 |
8.10 |
20 |
0.00552 |
0.00038 |
0.00011 |
| 10 |
GO:0030248 |
cellulose binding |
8 |
2 |
0.81 |
9 |
0.00084 |
0.00084 |
0.00084 |
| 11 |
GO:0004555 |
alpha,alpha-trehalase activity |
15 |
5 |
1.52 |
10 |
0.00096 |
0.00096 |
0.00096 |
| 12 |
GO:0016651 |
oxidoreductase activity, acting on NAD(P… |
47 |
5 |
4.76 |
49 |
0.02485 |
0.00150 |
0.02485 |
| 13 |
GO:0004252 |
serine-type endopeptidase activity |
143 |
29 |
14.49 |
11 |
0.00150 |
0.00150 |
0.00150 |
| 14 |
GO:0019205 |
nucleobase-containing compound kinase ac… |
67 |
12 |
6.79 |
47 |
0.02014 |
0.00157 |
0.00149 |
| 15 |
GO:0004553 |
hydrolase activity, hydrolyzing O-glycos… |
169 |
31 |
17.12 |
48 |
0.02309 |
0.00186 |
0.02309 |
| 16 |
GO:0005249 |
voltage-gated potassium channel activity |
54 |
8 |
5.47 |
21 |
0.00559 |
0.00193 |
0.00559 |
| 18 |
GO:0004867 |
serine-type endopeptidase inhibitor acti… |
20 |
4 |
2.03 |
13 |
0.00321 |
0.00321 |
0.00321 |
| 19 |
GO:0005507 |
copper ion binding |
29 |
8 |
2.94 |
14 |
0.00351 |
0.00351 |
0.00351 |
| 20 |
GO:0051015 |
actin filament binding |
47 |
11 |
4.76 |
17 |
0.00415 |
0.00415 |
0.00415 |
| 22 |
GO:0008083 |
growth factor activity |
8 |
2 |
0.81 |
19 |
0.00450 |
0.00450 |
0.00450 |
| 23 |
GO:0005272 |
sodium channel activity |
83 |
12 |
8.41 |
22 |
0.00627 |
0.00627 |
0.00627 |
| 24 |
GO:0005525 |
GTP binding |
334 |
45 |
33.84 |
24 |
0.00657 |
0.00657 |
0.00657 |
| 25 |
GO:0005201 |
extracellular matrix structural constitu… |
7 |
3 |
0.71 |
26 |
0.00755 |
0.00755 |
0.00755 |
| 26 |
GO:0031683 |
G-protein beta/gamma-subunit complex bin… |
17 |
4 |
1.72 |
27 |
0.00820 |
0.00820 |
0.00820 |
| 27 |
GO:0004550 |
nucleoside diphosphate kinase activity |
15 |
4 |
1.52 |
28 |
0.00837 |
0.00837 |
0.00837 |
| 28 |
GO:0004040 |
amidase activity |
5 |
3 |
0.51 |
29 |
0.00868 |
0.00868 |
0.00868 |
| 29 |
GO:0030246 |
carbohydrate binding |
83 |
20 |
8.41 |
31 |
0.01014 |
0.00990 |
0.02445 |
| 30 |
GO:0015267 |
channel activity |
476 |
68 |
48.22 |
543 |
0.83162 |
0.01020 |
0.00020 |
kable(
MFvar.all[MFvar.all$Significant < MFvar.all$Expected,],
caption = "Most under-represented terms of the Molecular Function ontology.")
Most under-represented terms of the Molecular Function ontology.
| 7 |
GO:0008417 |
fucosyltransferase activity |
42 |
4 |
4.25 |
6 |
0.00013 |
0.00013 |
0.00013 |
| 17 |
GO:0005515 |
protein binding |
4270 |
424 |
432.59 |
262 |
0.32261 |
0.00273 |
0.53496 |
| 21 |
GO:0008378 |
galactosyltransferase activity |
24 |
2 |
2.43 |
18 |
0.00422 |
0.00422 |
0.00422 |
vennDiagram(vennCounts(cbind(weight01=significant.weight01,
lea=significant.lea,
elim=significant.elim)))

kable(data.frame(Term=Term(GOTERM[sigTerms]),
Definition=Definition(GOTERM[sigTerms]),
PValue=score(MFvar.weight01)[sigTerms]),
caption = paste('Molecular functions terms significantly associated with', VAR,
'according to all 3 algorithms', sep=' '))
Molecular functions terms significantly associated with regime according to all 3 algorithms
| GO:0005509 |
calcium ion binding |
Interacting selectively and non-covalently with calcium ions (Ca2+). |
0.0000000 |
| GO:0004930 |
G protein-coupled receptor activity |
Combining with an extracellular signal and transmitting the signal across the membrane by activating an associated G-protein; promotes the exchange of GDP for GTP on the alpha subunit of a heterotrimeric G-protein complex. |
0.0000000 |
| GO:0008061 |
chitin binding |
Interacting selectively and non-covalently with chitin, a linear polysaccharide consisting of beta-(1->4)-linked N-acetyl-D-glucosamine residues. |
0.0000002 |
| GO:0043565 |
sequence-specific DNA binding |
Interacting selectively and non-covalently with DNA of a specific nucleotide composition, e.g. GC-rich DNA binding, or with a specific sequence motif or type of DNA e.g. promotor binding or rDNA binding. |
0.0000069 |
| GO:0004181 |
metallocarboxypeptidase activity |
Catalysis of the hydrolysis of C-terminal amino acid residues from a polypeptide chain by a mechanism in which water acts as a nucleophile, one or two metal ions hold the water molecule in place, and charged amino acid side chains are ligands for the metal ions. |
0.0000382 |
| GO:0003774 |
motor activity |
Catalysis of the generation of force resulting either in movement along a microfilament or microtubule, or in torque resulting in membrane scission, coupled to the hydrolysis of a nucleoside triphosphate. |
0.0000839 |
| GO:0008417 |
fucosyltransferase activity |
Catalysis of the transfer of a fucosyl group to an acceptor molecule, typically another carbohydrate or a lipid. |
0.0001261 |
| GO:0005200 |
structural constituent of cytoskeleton |
The action of a molecule that contributes to the structural integrity of a cytoskeletal structure. |
0.0001488 |
| GO:0005267 |
potassium channel activity |
Enables the facilitated diffusion of a potassium ion (by an energy-independent process) involving passage through a transmembrane aqueous pore or channel without evidence for a carrier-mediated mechanism. |
0.0003783 |
| GO:0030248 |
cellulose binding |
Interacting selectively and non-covalently with cellulose. |
0.0008382 |
| GO:0004555 |
alpha,alpha-trehalase activity |
Catalysis of the reaction: alpha,alpha-trehalose + H2O = 2 D-glucose. |
0.0009596 |
| GO:0004252 |
serine-type endopeptidase activity |
Catalysis of the hydrolysis of internal, alpha-peptide bonds in a polypeptide chain by a catalytic mechanism that involves a catalytic triad consisting of a serine nucleophile that is activated by a proton relay involving an acidic residue (e.g. aspartate or glutamate) and a basic residue (usually histidine). |
0.0015028 |
| GO:0005249 |
voltage-gated potassium channel activity |
Enables the transmembrane transfer of a potassium ion by a voltage-gated channel. A voltage-gated channel is a channel whose open state is dependent on the voltage across the membrane in which it is embedded. |
0.0019323 |
| GO:0004867 |
serine-type endopeptidase inhibitor activity |
Stops, prevents or reduces the activity of serine-type endopeptidases, enzymes that catalyze the hydrolysis of nonterminal peptide bonds in a polypeptide chain; a serine residue (and a histidine residue) are at the active center of the enzyme. |
0.0032063 |
| GO:0005507 |
copper ion binding |
Interacting selectively and non-covalently with copper (Cu) ions. |
0.0035076 |
| GO:0051015 |
actin filament binding |
Interacting selectively and non-covalently with an actin filament, also known as F-actin, a helical filamentous polymer of globular G-actin subunits. |
0.0041496 |
| GO:0008378 |
galactosyltransferase activity |
Catalysis of the transfer of a galactosyl group to an acceptor molecule, typically another carbohydrate or a lipid. |
0.0042160 |
| GO:0008083 |
growth factor activity |
The function that stimulates a cell to grow or proliferate. Most growth factors have other actions besides the induction of cell growth or proliferation. |
0.0044962 |
| GO:0005272 |
sodium channel activity |
Enables the facilitated diffusion of a sodium ion (by an energy-independent process) involving passage through a transmembrane aqueous pore or channel without evidence for a carrier-mediated mechanism. |
0.0062678 |
| GO:0005525 |
GTP binding |
Interacting selectively and non-covalently with GTP, guanosine triphosphate. |
0.0065740 |
| GO:0005201 |
extracellular matrix structural constituent |
The action of a molecule that contributes to the structural integrity of the extracellular matrix. |
0.0075499 |
| GO:0031683 |
G-protein beta/gamma-subunit complex binding |
Interacting selectively and non-covalently with a complex of G-protein beta/gamma subunits. |
0.0082007 |
| GO:0004550 |
nucleoside diphosphate kinase activity |
Catalysis of the reaction: ATP + nucleoside diphosphate = ADP + nucleoside triphosphate. |
0.0083664 |
| GO:0004040 |
amidase activity |
Catalysis of the reaction: a monocarboxylic acid amide + H2O = a monocarboxylate + NH3. |
0.0086755 |
showSigOfNodes(GOdataVar.MF, score(MFvar.weight01),
firstSigNodes = sum(significant.elim),
wantedNodes = sigTerms)

## $dag
## A graphNEL graph with directed edges
## Number of Nodes = 121
## Number of Edges = 157
##
## $complete.dag
## [1] "A graph with 121 nodes."
I plot variance portion, but for the term found most significant when using p-values, for comparison.
showGroupDensity(GOdataVar.MF, names(score(MF.weight01))[which.min(score(MF.weight01))])

Cellular component
orderedTerms <- names(sort(score(CCvar.weight01)))
significant.weight01 <- score(CCvar.weight01)[orderedTerms] <= 0.01
significant.lea <- score(CCvar.lea)[orderedTerms] <= 0.01
significant.elim <- score(CCvar.elim)[orderedTerms] <= 0.01
sigTerms <- orderedTerms[significant.weight01 & significant.lea & significant.elim]
if (length(sigTerms) == 0) {
sigTerms <- orderedTerms[significant.lea & significant.elim]
}
# sigTerms gets updated, and is already used in the code, but I need a permanent copy for later.
CC.variance.sigTerms <- sigTerms
CCvar.all <- GenTable(GOdataVar.CC, elim=CCvar.elim, weight01=CCvar.weight01, lea=CCvar.lea,
orderBy="weight01", ranksOf="elim", topNodes=max(sum(significant.elim), 10))
kable(
CCvar.all[CCvar.all$Significant > CCvar.all$Expected,],
caption = "Most over-represented terms of the Cellular Component ontology.")
Most over-represented terms of the Cellular Component ontology.
| 1 |
GO:0005576 |
extracellular region |
197 |
37 |
19.46 |
1 |
1.4e-09 |
3.8e-09 |
7.8e-11 |
| 2 |
GO:0016021 |
integral component of membrane |
1739 |
215 |
171.76 |
2 |
8.9e-07 |
4.1e-06 |
3.7e-08 |
| 3 |
GO:0005886 |
plasma membrane |
324 |
44 |
32.00 |
4 |
0.00031 |
6.0e-05 |
3.2e-05 |
| 4 |
GO:0016459 |
myosin complex |
57 |
19 |
5.63 |
3 |
6.2e-05 |
6.2e-05 |
6.2e-05 |
| 5 |
GO:0016020 |
membrane |
3059 |
361 |
302.14 |
7 |
0.00116 |
6.7e-05 |
1.8e-09 |
| 6 |
GO:0005743 |
mitochondrial inner membrane |
44 |
13 |
4.35 |
5 |
0.00037 |
9.3e-05 |
0.00037 |
| 7 |
GO:0008076 |
voltage-gated potassium channel complex |
30 |
5 |
2.96 |
6 |
0.00059 |
0.00059 |
0.00059 |
| 8 |
GO:0005929 |
cilium |
52 |
17 |
5.14 |
8 |
0.00289 |
0.00188 |
0.00072 |
| 9 |
GO:0031012 |
extracellular matrix |
11 |
3 |
1.09 |
9 |
0.00323 |
0.00323 |
0.00323 |
| 10 |
GO:0031225 |
anchored component of membrane |
8 |
4 |
0.79 |
10 |
0.00450 |
0.00450 |
0.00450 |
| 11 |
GO:0099512 |
supramolecular fiber |
52 |
13 |
5.14 |
12 |
0.00708 |
0.00507 |
0.00708 |
| 12 |
GO:0030990 |
intraciliary transport particle |
5 |
2 |
0.49 |
11 |
0.00586 |
0.00586 |
0.00586 |
| 14 |
GO:0005581 |
collagen trimer |
7 |
3 |
0.69 |
13 |
0.00723 |
0.00723 |
0.00723 |
| 15 |
GO:0005858 |
axonemal dynein complex |
10 |
3 |
0.99 |
14 |
0.00783 |
0.00783 |
0.00783 |
kable(
CCvar.all[CCvar.all$Significant < CCvar.all$Expected,],
caption = "Most under-represented terms of the Cellular Component ontology.")
Most under-represented terms of the Cellular Component ontology.
| 13 |
GO:0090575 |
RNA polymerase II transcription factor c… |
47 |
3 |
4.64 |
99 |
0.23664 |
0.00719 |
0.23664 |
vennDiagram(vennCounts(cbind(weight01=significant.weight01,
lea=significant.lea,
elim=significant.elim)))

kable(data.frame(Term=Term(GOTERM[sigTerms]),
Definition=Definition(GOTERM[sigTerms]),
PValue=score(CCvar.weight01)[sigTerms]),
caption = paste('Cellular component terms significantly associated with', VAR,
'according to all 3 algorithms', sep=' '))
Cellular component terms significantly associated with regime according to all 3 algorithms
| GO:0005576 |
extracellular region |
The space external to the outermost structure of a cell. For cells without external protective or external encapsulating structures this refers to space outside of the plasma membrane. This term covers the host cell environment outside an intracellular parasite. |
0.0000000 |
| GO:0016021 |
integral component of membrane |
The component of a membrane consisting of the gene products and protein complexes having at least some part of their peptide sequence embedded in the hydrophobic region of the membrane. |
0.0000041 |
| GO:0005886 |
plasma membrane |
The membrane surrounding a cell that separates the cell from its external environment. It consists of a phospholipid bilayer and associated proteins. |
0.0000603 |
| GO:0016459 |
myosin complex |
A protein complex, formed of one or more myosin heavy chains plus associated light chains and other proteins, that functions as a molecular motor; uses the energy of ATP hydrolysis to move actin filaments or to move vesicles or other cargo on fixed actin filaments; has magnesium-ATPase activity and binds actin. Myosin classes are distinguished based on sequence features of the motor, or head, domain, but also have distinct tail regions that are believed to bind specific cargoes. |
0.0000621 |
| GO:0016020 |
membrane |
A lipid bilayer along with all the proteins and protein complexes embedded in it an attached to it. |
0.0000667 |
| GO:0005743 |
mitochondrial inner membrane |
The inner, i.e. lumen-facing, lipid bilayer of the mitochondrial envelope. It is highly folded to form cristae. |
0.0000927 |
| GO:0008076 |
voltage-gated potassium channel complex |
A protein complex that forms a transmembrane channel through which potassium ions may cross a cell membrane in response to changes in membrane potential. |
0.0005872 |
| GO:0005929 |
cilium |
A specialized eukaryotic organelle that consists of a filiform extrusion of the cell surface and of some cytoplasmic parts. Each cilium is largely bounded by an extrusion of the cytoplasmic (plasma) membrane, and contains a regular longitudinal array of microtubules, anchored to a basal body. |
0.0018827 |
| GO:0031012 |
extracellular matrix |
A structure lying external to one or more cells, which provides structural support, biochemical or biomechanical cues for cells or tissues. |
0.0032341 |
| GO:0031225 |
anchored component of membrane |
The component of a membrane consisting of the gene products that are tethered to the membrane only by a covalently attached anchor, such as a lipid group that is embedded in the membrane. Gene products with peptide sequences that are embedded in the membrane are excluded from this grouping. |
0.0044950 |
| GO:0099512 |
supramolecular fiber |
A polymer consisting of an indefinite number of protein or protein complex subunits that have polymerised to form a fiber-shaped structure. |
0.0050745 |
| GO:0030990 |
intraciliary transport particle |
A nonmembrane-bound oligomeric protein complex that participates in bidirectional transport of molecules (cargo) along axonemal microtubules. |
0.0058559 |
| GO:0005581 |
collagen trimer |
A protein complex consisting of three collagen chains assembled into a left-handed triple helix. These trimers typically assemble into higher order structures. |
0.0072277 |
| GO:0005858 |
axonemal dynein complex |
A dynein complex found in eukaryotic cilia and flagella; the motor domain heads interact with adjacent microtubules to generate a sliding force which is converted to a bending motion. |
0.0078297 |
showSigOfNodes(GOdataVar.CC, score(CCvar.weight01),
firstSigNodes = sum(significant.weight01),
wantedNodes = sigTerms)

## $dag
## A graphNEL graph with directed edges
## Number of Nodes = 76
## Number of Edges = 135
##
## $complete.dag
## [1] "A graph with 76 nodes."
For comparison, I plot the distribution of variance portion for the CC term found most significant when using p-values.
showGroupDensity(GOdataVar.CC, names(score(CC.weight01))[which.min(score(CC.weight01))])

Comparison between the two ordering of genes.
Biological process
allTerms <- usedGO(GOdata.BP)
vennDiagram(vennCounts(cbind(PValue = allTerms %in% BP.pvalue.sigTerms,
Variance = allTerms %in% BP.variance.sigTerms)))

ggplot(data = data.frame(PValue = rank(score(BP.weight01))[allTerms],
Variance = rank(score(BPvar.weight01))[allTerms]),
mapping = aes(x = PValue, y = Variance)) +
geom_point() + geom_smooth() + xlab('Using gene p-values') +
ylab('Using portion of explained variance') +
ggtitle('Ordering of BP terms by significance')

Molecular function
allTerms <- usedGO(GOdata.MF)
vennDiagram(vennCounts(cbind(PValue = allTerms %in% MF.pvalue.sigTerms,
Variance = allTerms %in% MF.variance.sigTerms)))

ggplot(data = data.frame(PValue = rank(score(MF.weight01))[allTerms],
Variance = rank(score(MFvar.weight01))[allTerms]),
mapping = aes(x = PValue, y = Variance)) +
geom_point() + geom_smooth() + xlab('Using gene p-values') +
ylab('Using portion of explained variance') +
ggtitle('Ordering of MF terms by significance')

Cellular component
allTerms <- usedGO(GOdata.CC)
vennDiagram(vennCounts(cbind(PValue = allTerms %in% CC.pvalue.sigTerms,
Variance = allTerms %in% CC.variance.sigTerms)))

ggplot(data = data.frame(PValue = rank(score(CC.weight01))[allTerms],
Variance = rank(score(CCvar.weight01))[allTerms]),
mapping = aes(x = PValue, y = Variance)) +
geom_point() + geom_smooth() + xlab('Using gene p-values') +
ylab('Using portion of explained variance') +
ggtitle('Ordering of CC terms by significance')

Session info
I save the main variables to be able to load them in a new R session later.
save(allgenes2GO,
GOdata.BP, BP.elim, BP.weight01, BP.lea, BP.pvalue.sigTerms,
GOdata.MF, MF.elim, MF.weight01, MF.lea, MF.pvalue.sigTerms,
GOdata.CC, CC.elim, CC.weight01, CC.lea, CC.pvalue.sigTerms,
GOdataVar.BP, BPvar.elim, BPvar.weight01, BPvar.lea, BP.variance.sigTerms,
GOdataVar.MF, MFvar.elim, MFvar.weight01, MFvar.lea, MF.variance.sigTerms,
GOdataVar.CC, CCvar.elim, CCvar.weight01, CCvar.lea, CC.variance.sigTerms,
file = paste('Enrichment', TAG, VAR, 'RData', sep='.'))
sessionInfo()
## R version 3.6.2 (2019-12-12)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.3 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
##
## locale:
## [1] LC_CTYPE=ca_ES.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=es_ES.UTF-8 LC_COLLATE=ca_ES.UTF-8
## [5] LC_MONETARY=es_ES.UTF-8 LC_MESSAGES=ca_ES.UTF-8
## [7] LC_PAPER=es_ES.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=es_ES.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] grid stats4 parallel stats graphics grDevices utils
## [8] datasets methods base
##
## other attached packages:
## [1] Rgraphviz_2.30.0 ggplot2_3.2.1 limma_3.42.0
## [4] knitr_1.27 topGO_2.38.1 SparseM_1.78
## [7] GO.db_3.10.0 AnnotationDbi_1.48.0 IRanges_2.20.2
## [10] S4Vectors_0.24.3 Biobase_2.46.0 graph_1.64.0
## [13] BiocGenerics_0.32.0
##
## loaded via a namespace (and not attached):
## [1] tidyselect_0.2.5 xfun_0.12 purrr_0.3.3 splines_3.6.2
## [5] lattice_0.20-38 colorspace_1.4-1 vctrs_0.2.1 htmltools_0.4.0
## [9] yaml_2.2.0 mgcv_1.8-31 blob_1.2.1 rlang_0.4.2
## [13] pillar_1.4.3 glue_1.3.1 withr_2.1.2 DBI_1.1.0
## [17] bit64_0.9-7 matrixStats_0.55.0 lifecycle_0.1.0 stringr_1.4.0
## [21] munsell_0.5.0 gtable_0.3.0 memoise_1.1.0 evaluate_0.14
## [25] labeling_0.3 highr_0.8 Rcpp_1.0.3 backports_1.1.5
## [29] scales_1.1.0 farver_2.0.3 bit_1.1-15.1 digest_0.6.23
## [33] stringi_1.4.5 dplyr_0.8.3 tools_3.6.2 magrittr_1.5
## [37] lazyeval_0.2.2 tibble_2.1.3 RSQLite_2.2.0 crayon_1.3.4
## [41] pkgconfig_2.0.3 zeallot_0.1.0 Matrix_1.2-18 assertthat_0.2.1
## [45] rmarkdown_2.1 R6_2.4.1 nlme_3.1-143 compiler_3.6.2